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ABSTRACT 

The central idea in building and aaintainxng an item 
bank is to calibrate all the iteas onto a "coaaon variable." The 
arithaetic involved in the calibration process is presented. It is 
recoaaended that an analysis of fit be done in every application to 
verify that the estiaates of itea difficulties are in fact 
saaple-free. These procedures are explained. Once an itea bank is 
built, a coaaon calibration for all iteas should be established and 
routinely checked. Special procedures for adding new iteas, updating 
old iteas, and dropping obsolete iteas are described. (BH) 
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BASIC IDEAS IN ITEM BANKING 

Ronald Mead 
MESA Psychometric Laboratory 
Department of Education 

University of Chicago 

The central idea in building and maintaining an item bank 
is to "calibrate" all the items onto a "common variable". The a- 
rithmetic Involved in the calibration process is well known and 
straightforward (Choppin, 1968; Wright and Stone, 1979; Rentz and 
Bashaw, 1977; Mead and Kreines, 1978) so I will deal with that 
first. The implication in the rhrase "common variable" is the 
notion that all the items measure the same thing. Establishing 
that this is reasonable goes beyond calibration and is normally 
called something like "item fit analysis" but "validation" might 
be a more appropriate name. I will consider that later. 

CALIBRATION 

Calibrating a Single Form 

When ail the people take all the items, you have an "item 
bank" as soon as you have computed estimates of the difficulties. 
There are a number of ways that this can be accomplished by hand 
or by one of several computer programs (eg., BICAL). v In the pro- 
cess, the origin of the scale is set at the average item diffi- 
culty but this is only a numeric convenience. 

The number associated with an item is the distance from the 
center of the form to the item in question. A negative value in- 
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dicates that the item is easier than the average and a positive 
value indicates that the item is harder than the average. 

Calibrating Two Forms 

If the items are in two forms instead of just one, the first 
step is the same: calibrate eoch form separately. We then have 
two banks, each with its mean difficulty set to zero. Combining 
them requires finding the distance between these two origins. For 
this to be possible, the two calibrations must have something in 
common. This can be either common items or common persons. 

To illustrate the idea, consider two forms, A ar. 6, shown 
in Figure 1, with a single common item linking them. To give it a 
name, let's call it 'Item 7 f and assume it has an estimated dif- 
ficulty of +1.0 in Form A and -0.5 in Form B. In other words, the 
distance from the center of Form A to Item 7 is 1 logit and the 
distance from Item 7 to the center of Form B is another half lo- 
git. This makes the distance from the center or "origin" of Form 
A to the origin of Form B: 

1.0-(-0.5) - 1.5 logits. 
The only sleight of hand is that I was careful to change the sign 
of the half logit to show that I was going from the item to the 
origin of Form B rather than from the origin to the item. 

If there are several common items, then I would work with 
their average difficulty but the logic is unchanged. Similarly if 
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there is a group of common people rather than common items, I 
would work with their average ability. The basic process is the 
same in any case: Find the distance from the first origin to the 
point in common and then the distance from the common point to 
the second origin. 

The sum of these distances is typically referred to as the 
"link" between the two forms (sometimes it is called the "trans- 
lation constant" or the "shift"). The way I have arranged it 
here, it is the amount that should be added to the difficulties 
of all the items in Form B to shift them onto the origin of Form 
A. (There is nothing sacred about that particular origin, how- 
ever; we can shift it to some more convenient point if there is 
any reason to do this.) 

J The complication remaining is what to do with the pair of 
difficulties we now have for each of the common items. Because 
these difficulties were estimated from different data, they will 
never be exactly the same. Unless there is some reason to prefer 
one calibration over the other, a reasonable thing to do is to 
take a weighted average using the inverse square of the standard 
errors of calibration as the weights. This weighs each estimate 
by the amount of information it contains and takes account of 
both how large and how relevant each sample is. The inverse 
square root of the sum of weights is then the standard error for 
the pooled estimate. 

»»-» 

.> 
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Calibrating Several Forms 

Establishing an item bank usually requires more items t^an 

can be given in one or two forms. When several forms are invol- 
ved, we begin in the same way: 

a) Calibrate each form separately, and 

b) Find the link between each pair of forms that ha^e 
a common point, 

Then, because we are dealing with data, the set of links will no& 
\ be consistent. For example, in Figure 2, linking Form A to Form 
\ B, then Form B to Form C, and finally Form C to Form A amounts to 
^ linking Form A to itself and so the sum of those links should be 
v ^ero. However, it can never be exactly zero, so we need a pro- 
cedure to resolve the inevitable inconsistencies. 
\ 

\ 

Engelhard and Osberg (1981) give the general least squares 
answer, but a procedure (Wright and Stone, 1979), which gives the 
same result and avoids matrix algebra, is* 

1) Construct the matrix of link constants t(i,j) (the 
distance to Form i from Form j). 

2) Fill in a good guess for any link that is missing. 
(Use zero, if you have no better idea.) 

3) Compute row means T(i) for the entire matrix 
exactly as though it were full. (Include the diagonal 
which is always zero.) 

4) Fill in estimates for the missing links computed 
as 
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t(i,j) = T(i) - T(j) 

5) Repeat Steps 3 and 4 until the matrix stabilizes. 

6) Translate difficulties on Form i to the center of 
the bank by adding the mean for row i. 

~d(i) = d(i) + T(i) 
Figure 2 and Table 1 illustrate this for a network of five forms. 
Some attention must always be given to the direction of the ar- 
rows and the signs of the numbers. 

In Table 1, I started with zeros for the missing links. It 
then took ten steps to stabilize. It would have been more rea- 
sonable and quicker to make some intelligent guesses from Figure 
2, say f -2.0 for link AE; -1.4 for link AD and -1.6 for link BE. 
Exactly the same thing could also have been done from Table 1 by, 
for example, subtracting CE from CA to obtain an estimate of AE. 

The row mean for Row i is the number that should be added to 
the difficulties of all items on Form i to shift them onto the 
common origin, which in this procedure is the center of all the 
forms. There is no magic in this particular origin. Once we have 
established a common origin we can shift it anywhere that is 
convenient for our purposes. 

ANALYSIS OF FIT 

So far this has been nothing more than elementary arithmetic 
and good housekeeping. The only hard part is keeping the signs 
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straight. However, it has all been based on the proposition that 
the data comform to the dichotomous Rasch model. If the items 
work in a way that isa reasonable approximation to this propo- 
sition, then the estimates of item difficulties are in fact 
sample-free and everything we have done is valid and just as easy 
as it seenrs. Otherwise things can become more 1 complicated. We 
cannot, however, assume that everything is the way we would like 
it; this must be verified in every application. 

There are three phases in the fit analysis but the idea is 
the same in each of them: Specific objectivity explicitly refers 
to the freedom from the ability distribution but it also means, 
for any appropriate sample, freedom from age, grade, school, 
race, or sex as well. The fit analysis asks if this appears to be 
the case for the data in hand. 

Phase I: Within Form Fit 

The first point at which the fit analysis must be done is 
when calibrating each form. Ideally this would involve checking 
that the difficulties are invariant with respect to every pos- 
sible subdivision of the sample. This could be done physically by 
dividing up the sample into groups defined by ability, race, sex, 
age, grade, etc., and reestimating the item difficulties within 
each group. Likelihood ratios could then be formed to test the 
equality of different sets of difficulties (Gustafsson, 1978) or 
they can be plotted against each other (Rasch, 1960; Wright, 
1968; Wright and Stone, 1979). 

O ft 
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A useful shortcut is to use the "Between Score Group" and 
"Total" fit statistics (as computed by BICAL) . Both of these 
statistics are based on approximate chi-squares derived from the 
unconditional maximum likelihood equations and are easy to com- 
pute. 

The between score group analysis automatically divides the 
sample by ability and explicitly asks if the empirically obtained 
item characteristic curve approximates the required shape. If it 
does, then the abUity groups agree on the difficulty and we can 
be as confident of our estimate as their standards error permit. 

The total fit statistic is an attempt to cover everything 
else without being explicit about it. While it is not partic- 
ularly sensitive to subtle departures from objectivity, it de- 
tects irregularities which threaten the basic meaning of the data 
quite well. 

Formal tests of significance are not of much interest here 
for three reasons. It is not clear what the null distributions of 
the individual item statistics really are. Even if it were, null 
distributions would not help much since we really want to do a 
series of sequential tests on the items. And we can always make 
any amount of irregularity acquire any amount of significance by 
ad just ing the sample size. 
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I depend much more on examining plots. Rather than arbi- 
trarily excluding every item we think is more than two or three 
standard errors away from expectation, we can plot the various 
fit statistics against each other and look for points in the plot 
that are outliers from whatever the distribution is. I am not too 
concerned if the distribution is fatter than it is supposed to be 
as long as it seems to be one distribution. The distribution be- 
ing overweight does influence my opinion of the standard errors, 
however . 

Items identified as misfitting in this manner are almost 
always easily diagnosed if we are willing to look hard enough. 
They are items that are miskeyed, that have no right answer, that 
have more than one right answer, that have a smart way to find 
the wrong answer, or that have an interaction with special in- 
struction or experience. Recognising the items require investi- 
gation from histograms of the fit statistics is straight for ;ard. 
Correcting or eliminating them can be done in comfort when we 
have discovered the particular events that produced their aberent 
performance. Going beyond this and successively rejecting each 
"next . worst fitting" item becomes both statistically and sub- 
stantively uncertain with no cleir stopping points. 

/ 

Phase II: Within Link Fit j 

i 

Once we have Satisfied ourselves that the items cali/brated 
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for each form are sufficiently consistent, we can begin linking 
the ones with common points. For each replication within the 
connecting elements, we have another level of fit analysis. For' 
common items, we are asking whether the two samples (the one that 
took Form A and the one that took Form B) define the same scale. 
This is a form of between group fit analysis where the groups are 
defined by occasion. 

This is most easily investigated in a picture of the link 
made by plotting the two sets of difficulties against each other. 
The points in this plot should follow (within standard errors) a 
straight line with slope one and intercepts t(i,j) and t(j,i). 
Items which stand away from this line do not have "occasion-free" 
calibrations. 

The analysis of fit can be done in a manner analogous to 
that described for within form. Rather than imposing an absolute 
standard, look for items that are obviously different than the 
others without worrying too much about where the standard error 
control lines actually fall. Items identified by this approach 
are usually easy to explain. 

Items which do not fit in a link usually turn out to be: 

i) different items that were given the same name, 

ii) items that were printed differently in one form, 

iii) items whose answers changed between 
administrations, or 
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iy) items that interact with special experience or 
instruct ion . 

The last category deserves some discussion. Ideally, we 
would like persons receiving instruction to move forward 
on the variable but we would not like them to disturb our oper- 
ational definition of it. If this were really true, then it would 
not matter when during instruction the items were calibrated. The 
items, however, are only imperfect instances of the variable, 
being told the answer to one of them, or even being told how to 
solve a special class of them, does not necessarily make a person 
better able to deal with- every ^other item. 

For example, in a bank of mathematics items recently con- 
structed' at the MESA Psychometric Laboratory (Wright and Stone, 
1980), it was found that fraction problems written horizontally 
were harder than the same problems written vertically for fifth 
graders, but not for sixth graders. An extraneous variable of 
practice or familiarity distinguished the two grades with respect 
to horizontal items. For ^he fifth graders, they had one dif- 
ficulty determined jointly by the complexity of the arithmetic 
and the unfamilarity of the format. For the sixth graders, they 
had another Cower) di'ficulty because the format was no longer a 
factor . 

In this case, the items were included in the bank with the 
sixth grade difficulties. This means that when they are given to 
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a child who has trouble with the horizontal notation, this will 
show up in the fit analysis for that child as a cluster of un- 
expected failures, diagnositic of this child's particular defi- 
ciency. For fifth graders, this should not be alarming; for older 
children, it might warrant some action. 

Phase II*. Between Link Fit 

When we are dealing with a matrix of links (Table 1), we can 
take the analysis one step further. Since each entry in the ma- 
trix can be predicted from the margin, i.e., t ( i , j )=T( i )-T( j ) , we 
can compute residuals for each of the observed links. The i..atrix 
of residuals can then be summarized in whatever manner interests 
us to check if particular forms, levels or samples seem to pre- 
sent unusual problems (e.g., plot observed against expected) 

While I am unable to provMe any fool proof rules for de- 
tecting refitting items or links, I cannot over emphasize the 
importance of performing analyses of misfit. When dealing with a 
single, fixed form, it is posrible to live with a very loose ap- 
proximation to specific objectivity. However, as item banks groi 
larger and cover wider ranges, even minor departures from ob- 
jectivity become important. If you are planning to do "test- 
fre*d" measurement, you need a bank well enough constructed to 
support it. Also the careful investigation of items you are in- 
terested in is always instructive. It invariably leads to new 
insights into the variable and how people relate to it. 
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Bank Maintaina^ce 

The basic ideas in maintaining a bank are the same as for 
building it. We need to establish a common calibration for all 
items and we need to check routinely that things are working the 
way we want. There are a few new details we should think about 
explicitly. 

Adding New Items 

New. items can be added anytime we like. We need only ad" 
minister them with some previously calibrated items and use these 
as a common point with the bank. This amounts to treating the 
bank as through it were a form which has some items in common 
with the new form. A link can then be calculated, added to the 
difficulties of the new items, and the new records inserted into 
the file. Of course, an analysis of fit will be performed on the 
old items to assure ourselves that everything is still under 
control . 

Updating Old Items 

We again have the problem of what to do with the old items 
now that we have still another estimate of their difficulties. 
There are two schools of thought. We can average in the new in- 
formation using the same weighted average as before. Or we can 
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leave the bank difficulties as they were. I favor the second ap- 
proach • 

Averaging is appropriate only if we have evidence that the 
difficulties have not changed (i.e., the test of fit was accept- 
able). In other words, averaging is appropriate only when it is 
not necessary (unless of course, we want to decrease the standard 
error ) . 

Averaging will also create seme fuzziness about how to in- 
terpret results. A given score on a fixed form will not be as- 
sociated with exactly the same ability as it was last year. This 
will be hard to explain to people who are trying to use the 
results. 

Continuous updating of the banked difficulties can have a 
more dangerous aspect. It can obscure small but real drifts in 
the difficulties of some items. If there is a slow but systematic 
change, allowing ourselves to adjust for it automatically may 
keep us from noticing it. 

A more appealing procedure (once we have acceptable standard 
errors) is to leave the difficulties where they were in the o- 
riginal calibration until we have strong evidence that they have 
changed. When that happens, we can either drop the item or sub- 
stitute the new difficulty. 
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Dropping Obsolete Items 

Once an item has become obsolete, it should be eliminated 
from the bank. The question is not what to do but when. "Obso- 
lete" means that it no longer belongs on the va-riable we are 
measuring. This will be the result cf the item failing some phase 
of the lit analysis after it has been reused. 

The decision of when to update a difficulty or when to drop 
an item is rarely obvious. There should be a periodic analysis of 
each item's behavior over all its administrations. This is a be- 
tween-occasion analysis and requires only that we save the item's 
, history. When there appear to be differences in the difficulties, 
then some action is needed. Whether that action is dropping the 
item or repairing our opinion of it will depend on what we think 
has happened. This is a substantive question that should be man- 
ageable once the statistical analysis has attracted our attention 
to the item. 
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Figure 1: Linking Two Forms 
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Figure 2: Link Network for Five Forms 
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Table 1: Calculating Links for Several Interconnected Forms 
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